智能论文笔记

Bi-directional Feature Reconstruction Network for Fine-Grained Few-Shot Image Classification

Jijie Wu , Dongliang Chang , Aneeshan Sain , Xiaoxu Li , Zhanyu Ma , Jie Cao , Jun Guo , Yi-Zhe Song

分类：计算机视觉

2022-11-30

The main challenge for fine-grained few-shot image classification is to learn feature representations with higher inter-class and lower intra-class variations, with a mere few labelled samples. Conventional few-shot learning methods however cannot be naively adopted for this fine-grained setting -- a quick pilot study reveals that they in fact push for the opposite (i.e., lower inter-class variations and higher intra-class variations). To alleviate this problem, prior works predominately use a support set to reconstruct the query image and then utilize metric learning to determine its category. Upon careful inspection, we further reveal that such unidirectional reconstruction methods only help to increase inter-class variations and are not effective in tackling intra-class variations. In this paper, we for the first time introduce a bi-reconstruction mechanism that can simultaneously accommodate for inter-class and intra-class variations. In addition to using the support set to reconstruct the query set for increasing inter-class variations, we further use the query set to reconstruct the support set for reducing intra-class variations. This design effectively helps the model to explore more subtle and discriminative features which is key for the fine-grained problem in hand. Furthermore, we also construct a self-reconstruction module to work alongside the bi-directional module to make the features even more discriminative. Experimental results on three widely used fine-grained image classification datasets consistently show considerable improvements compared with other methods. Codes are available at: https://github.com/PRIS-CV/Bi-FRN.

translated by 谷歌翻译

最近对基于细粒的基于草图的图像检索（FG-SBIR）的重点已转向将模型概括为新类别，而没有任何培训数据。但是，在现实世界中，经过训练的FG-SBIR模型通常应用于新类别和不同的人类素描器，即不同的绘图样式。尽管这使概括问题复杂化，但幸运的是，通常可以使用一些示例，从而使模型适应新的类别/样式。在本文中，我们提供了一种新颖的视角 - 我们没有要求使用概括的模型，而是提倡快速适应的模型，在测试过程中只有很少的样本（以几种方式）。为了解决这个新问题，我们介绍了一种基于几个关键修改的基于新型的模型 - 静态元学习（MAML）框架：（1）作为基于边缘的对比度损失的检索任务，我们简化了内部循环中的MAML训练使其更稳定和易于处理。（2）我们的对比度损失的边距也通过其余模型进行了元学习。（3）在外循环中引入了另外三个正规化损失，以使元学习的FG-SBIR模型对类别/样式适应更有效。在公共数据集上进行的广泛实验表明，基于概括和基于零射的方法的增益很大，还有一些强大的射击基线。

translated by 谷歌翻译

In this paper, we extend scene understanding to include that of human sketch. The result is a complete trilogy of scene representation from three diverse and complementary {modalities} -- sketch, photo, and text. Instead of learning a rigid three-way embedding and be done with it, we focus on learning a flexible joint embedding that fully supports the ``optionality" that this complementarity brings. Our embedding supports optionality on two axis: (i) optionality across modalities -- use any combination of modalities as query for downstream tasks like retrieval, (ii) optionality across tasks -- simultaneously utilising the embedding for either discriminative (e.g., retrieval) or generative tasks (e.g., captioning). This provides flexibility to end-users by exploiting the best of each modality, therefore serving the very purpose behind our proposal of a trilogy at the first place. First, a combination of information-bottleneck and conditional invertible neural networks disentangle the modality-specific component from modality-agnostic in sketch, photo, and text. Second, the modality-agnostic instances from sketch, photo, and text are synergised using a modified cross-attention. Once learned, we show our embedding can accommodate a multi-facet of scene-related tasks, including those enabled for the first time by the inclusion of sketch, all without any task-specific modifications.

translated by 谷歌翻译

我们使用徒手场景草图FS-Coco的第一个数据集将草图研究推向了场景。考虑到实用的应用，我们收集的草图很好地传达了场景内容，但可以在几分钟之内由具有素描技巧的人勾勒出来。我们的数据集包含10,000个徒手场景向量素描，每点时空信息由100个非专家个人提供，提供对象和场景级抽象。每个草图都用文本描述增强。使用我们的数据集，我们首次研究了徒手场景草图和草图标题的细粒度图像检索问题。我们了解以下内容：（i）使用笔触的时间顺序在草图中编码的场景显着性；（ii）从场景草图和图像标题中进行图像检索的性能比较；（iii）素描和图像标题中信息的互补性，以及结合两种方式的潜在优势。此外，我们扩展了一个流行的矢量草图基于LSTM的编码器，以处理比以前的工作所支持的更复杂性的草图。也就是说，我们提出了一个层次草图解码器，我们将其在特定于草图的“预文本”任务中利用。我们的数据集可以首次研究徒手场景素描理解及其实际应用。

translated by 谷歌翻译